The systematic collection of speech corpora for all eleven official South african languages

نویسندگان

  • Marissa van Rooyen
  • Cecile van Zyl
  • Nico Oosthuizen
چکیده

In this paper we outline the methods and best practices when collecting speech data for under-resourced languages. The focus of this discussion is on showing ways of improving the quality of the collection and turnaround time. This paper shows how to deal with matters concerning assistants and technical problems, as well as suggesting ways in which data management may be optimised with the use of certain techniques. This article aims at providing the reader with a total overview of improvements made during the course of a real data collection project with tangible problems and results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spelling Checker-based Language Identification for the Eleven Official South African Languages

Language identification is often the first step when compiling corpora from web pages or other unstructured sources. In this paper, an effective and accurate method for identification of all eleven official South African languages is presented. The method is based on reusing commercial spelling checkers and consists of a multi-stage architecture that is described in detail. We describe the impl...

متن کامل

Collecting and evaluating speech recognition corpora for 11 South African languages

We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpu...

متن کامل

The NCHLT speech corpus of the South African languages

The NCHLT speech corpus contains wide-band speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa. We describe the design and development processes that were undertaken in order to develop the corpus, and report on associated materials such as orthographic transcriptions and pronunciation dictionaries that were released as part of the corpu...

متن کامل

Rapid Development of TTS Corpora for Four South African Languages

This paper describes the development of text-to-speech corpora for four South African languages. The approach followed investigated the possibility of using low-cost methods including informal recording environments and untrained volunteer speakers. This objective and the additional future goal of expanding the corpus to increase coverage of South Africa’s 11 official languages necessitated exp...

متن کامل

African speech technology (AST) telephone speech databases: corpus design and contents

The African Speech Technology project is developing telephone speech databases for five of South Africa’s eleven official languages, i.e. South African English, Afrikaans, and three African languages, Zulu, Xhosa, and Southern Sotho. These databases will be fully transcribed – orthographically and phonetically – and will be used for the training and testing of phoneme-based, speaker-independent...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008